59 research outputs found

    Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech

    Full text link
    The rapid population aging has stimulated the development of assistive devices that provide personalized medical support to the needies suffering from various etiologies. One prominent clinical application is a computer-assisted speech training system which enables personalized speech therapy to patients impaired by communicative disorders in the patient's home environment. Such a system relies on the robust automatic speech recognition (ASR) technology to be able to provide accurate articulation feedback. With the long-term aim of developing off-the-shelf ASR systems that can be incorporated in clinical context without prior speaker information, we compare the ASR performance of speaker-independent bottleneck and articulatory features on dysarthric speech used in conjunction with dedicated neural network-based acoustic models that have been shown to be robust against spectrotemporal deviations. We report ASR performance of these systems on two dysarthric speech datasets of different characteristics to quantify the achieved performance gains. Despite the remaining performance gap between the dysarthric and normal speech, significant improvements have been reported on both datasets using speaker-independent ASR architectures.Comment: to appear in Computer Speech & Language - https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial text overlap with arXiv:1807.1094

    Articulatory representations to address acoustic variability in speech

    Get PDF
    The past decade has seen phenomenal improvement in the performance of Automatic Speech Recognition (ASR) systems. In spite of this vast improvement in performance, the state-of-the-art still lags significantly behind human speech recognition. Even though certain systems claim super-human performance, this performance often is sub-par across domains and across datasets. This gap is predominantly due to the lack of robustness against speech variability. Even clean speech is extremely variable due to a large number of factors such as voice characteristics, speaking style, speaking rate, accents, casualness, emotions and more. The goal of this thesis is to investigate the variability of speech from the perspective of speech production, put forth robust articulatory features to address this variability, and to incorporate these features in state-of-the-art ASR systems in the best way possible. ASR systems model speech as a sequence of distinctive phone units like beads on a string. Although phonemes are distinctive units in the cognitive domain, their physical realizations are extremely varied due to coarticulation and lenition which are commonly observed in conversational speech. The traditional approaches deal with this issue by performing di-, tri- or quin-phone based acoustic modeling but are insufficient to model longer contextual dependencies. Articulatory phonology analyzes speech as a constellation of coordinated articulatory gestures performed by the articulators in the vocal tract (lips, tongue tip, tongue body, jaw, glottis and velum). In this framework, acoustic variability is explained by the temporal overlap of gestures and their reduction in space. In order to analyze speech in terms of articulatory gestures, the gestures need to be estimated from the speech signal. The first part of the thesis focuses on a speaker independent acoustic-to-articulatory inversion system that was developed to estimate vocal tract constriction variables (TVs) from speech. The mapping from acoustics to TVs was learned from the multi-speaker X-ray Microbeam (XRMB) articulatory dataset. Constriction regions from TV trajectories were defined as articulatory gestures using articulatory kinematics. The speech inversion system combined with the TV kinematics based gesture annotation provided a system to estimate articulatory gestures from speech. The second part of this thesis deals with the analysis of the articulatory trajectories under different types of variability such as multiple speakers, speaking rate, and accents. It was observed that speaker variation degraded the performance of the speech inversion system. A Vocal Tract Length Normalization (VTLN) based speaker normalization technique was therefore developed to address the speaker variability in the acoustic and articulatory domains. The performance of speech inversion systems was analyzed on an articulatory dataset containing speaking rate variations to assess if the model was able to reliably predict the TVs in challenging coarticulatory scenarios. The performance of the speech inversion system was analyzed in cross accent and cross language scenarios through experiments on a Dutch and British English articulatory dataset. These experiments provide a quantitative measure of the robustness of the speech inversion systems to different speech variability. The final part of this thesis deals with the incorporation of articulatory features in state-of-the-art medium vocabulary ASR systems. A hybrid convolutional neural network (CNN) architecture was developed to fuse the acoustic and articulatory feature streams in an ASR system. ASR experiments were performed on the Wall Street Journal (WSJ) corpus. Several articulatory feature combinations were explored to determine the best feature combination. Cross-corpus evaluations were carried out to evaluate the WSJ trained ASR system on the TIMIT and another dataset containing speaking rate variability. Results showed that combining articulatory features with acoustic features through the hybrid CNN improved the performance of the ASR system in matched and mismatched evaluation conditions. The findings based on this dissertation indicate that articulatory representations extracted from acoustics can be used to address acoustic variability in speech observed due to speakers, accents, and speaking rates and further be used to improve the performance of Automatic Speech Recognition systems

    Electronic transport properties of DNA sensing nanopores : insight from quantum mechanical simulations

    Get PDF
    The translocation of DNA through nanopores is an intensively studied field as it can lead to a new perspective in DNA sequencing. During this process the DNA is electrophoretically driven through a nanoscale hole in a membrane, and use different sensing schemes to read out the sequence. Within the scope of nanopore sequencing two important sensing schemes relevant to this thesis are: 1.) Tunneling sequencers based on solid state nanopores embedded with gold electrodes 2.) 2D materials beyond graphene For scheme 1, an obvious improvement is to coat the gold electrode with molecules that have high conductance and can form instantaneous hydrogen bond bridges with the translocating polynucleotide thereby improving the transverse current signal. The molecule that we propose is the so called diamondoid which are diamond caged molecules with hydrogen termination. Before applying such a molecule to a nanopore electrode set up, one would like to understand their interaction with DNA and its nucleobases. For this purpose, hydrogen bonded complexes formed between nitrogen doped derivatives of smallest diamondoids (i.e. adamantane derivatives) and nucleobases were investigated using dispersion corrected density functional theory (DFT). Mutated and methylated nucleobases are also taken into consideration in these investigations. DFT calculations revealed that hydrogen bonds are of moderate strength. In addition, starting from the DFT predicted hydrogen bonding configuration for each complex, rotations, and translations along a reference axis was performed to capture variations in the interaction energies along the donor-acceptor groups of the hydrogen bonds. The electronic density of states analysis for the hydrogen bonded complexes revealed distinguishable signatures for each nucleobase, thereby showing the suitability for application in electrodes functionalised with such probe molecules. In the next step, an adamantane derivative is placed on one of the electrode and nucleotides are introduced in such a way that nucleobases form hydrogen bonds with the of the nitrogen group of the adamantane derivatives. Electronic transport calculations were performed for gold electrodes functionalised with 3 different adamantane derivatives. Four pristine nucleotides, one mutated, and one methylated nucleotides were considered. Analysis of the transmission spectra reveal that each of the nucleotides has a unique resonance peak far below the Fermi level. We have also proposed a gating voltage window to sample the resonance peaks of the nucleotide so that they can be distinguished from each other. An alternative to tunneling sequencers would be to use nanopores built in to ultra thin metallic nanoribbons such as graphene. The sequence can be read out from the in-plane current modulation resulting from the local field effect of the translocating nucleotides in the vicinity of the metallic pore edges. But the hydrophobicity of graphene makes it a difficult candidate in aqueous environment. Hence in scheme 2, the aim is to model an ultra thin material that can rectify the hydrophobicity of graphene and can be a very good candidate for current modulation sequencing. Ultra thin MoS2 (2H) monolayer exist as direct band gap semiconductor. Nanopores based on 2H phases have been reported in the literature and are not hydrophobic. By means of chemical exfoliation of the 2H phase, a meta stable 1T phase of MoS2 has also been synthesized by various experimental groups. The 1T phase of MoS2 is metallic. The aim of this thesis is to model a nano-biosensor template based on a hybrid MoS2 monolayer made up of a metallic (1T) phase sandwiched between semiconducting (2H) phase. The sensor that we propose, should have only metallic nanopore edges. As a first step, we have modeled the semiconductor-metal interface, and compared them with experiments. Then an investigation to understand the influence of the increase of the metallic unit on the electronic properties is performed. Since, point defects are highly relevant to electrochemical pore growth, a point sulfur defect analysis is provided to ascertain the weakest point in the sheet. Finally to understand the effect of the interface electronic transport calculations are performed. The transmission spectra reveals a clear asymmetry in the current flow across the interface by means of gating. In the end, the relevance of such a hybrid MoS2 material for nanopore sequencing is discussed

    Audio Data Augmentation for Acoustic-to-articulatory Speech Inversion using Bidirectional Gated RNNs

    Full text link
    Data augmentation has proven to be a promising prospect in improving the performance of deep learning models by adding variability to training data. In previous work with developing a noise robust acoustic-to-articulatory speech inversion system, we have shown the importance of noise augmentation to improve the performance of speech inversion in noisy speech. In this work, we compare and contrast different ways of doing data augmentation and show how this technique improves the performance of articulatory speech inversion not only on noisy speech, but also on clean speech data. We also propose a Bidirectional Gated Recurrent Neural Network as the speech inversion system instead of the previously used feed forward neural network. The inversion system uses mel-frequency cepstral coefficients (MFCCs) as the input acoustic features and six vocal tract-variables (TVs) as the output articulatory features. The Performance of the system was measured by computing the correlation between estimated and actual TVs on the U. Wisc. X-ray Microbeam database. The proposed speech inversion system shows a 5% relative improvement in correlation over the baseline noise robust system for clean speech data. The pre-trained model, when adapted to each unseen speaker in the test set, improves the average correlation by another 6%.Comment: EUSIPCO 202
    corecore